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Abstract 

Unsupervised models can provide supplementary soft con- 
straints to help classify new target data under the assump- 
tion that similar objects in the target set are more likely to 
share the same class label. Such models can also help de- 
tect possible differences between training and target distri- 
butions, which is useful in applications where concept drift 
may take place. This paper describes a Bayesian frame- 
work that takes as input class labels from existing classifiers 
(designed based on labeled data from the source domain), 
as well as cluster labels from a cluster ensemble operating 
solely on the target data to be classified, and yields a con- 
sensus labeling of the target data. This framework is partic- 
ularly useful when the statistics of the target data drift or 
change from those of the training data. We also show that 
the proposed framework is privacy-aware and allows per- 
forming distributed learning when data/models have shar- 
ing restrictions. Experiments show that our framework can 
yield superior results to those provided by applying classifier 
ensembles only. 

1 Introduction 

In several data mining applications, one builds an 
initial classification model that needs to be applied 
to unlabeled data acquired subsequently. Since the 
statistics of the underlying phenomena being modeled 
changes with time, these classifiers may also need to 
be occasionally rebuilt if performance degrades beyond 
an acceptable level. In such situations, it is desirable 
that the classifier functions well with as little labeling 
of new data as possible, since labeling can be expensive 
in terms of time and money, and a potentially error- 
prone process. Moreover, the classifier should be able 
to adapt to changing statistics to some extent, given the 
aforementioned constraints. 

This paper addresses the problem of combining 
multiple classifiers and clusterers in a fairly general 
setting, that includes the scenario sketched above. An 
ensemble of classifiers is first learnt on an initial labeled 
training dataset after which the training data can be 

* U niversity of Texas at Austin, Austin, TX, USA. Email: 
{ aacharya® , ghosh@ece } . ut exas . edu 

t University of Sao Paulo at Sao Carlos, Brazil. Email: 
erh@icmc.usp.br 

teBay Research Lab, San Jose, CA, USA. Email: {bsarwar, 
jruvini}@ebay.com 



discarded. Subsequently, when new unlabeled target 
data is encountered, a cluster ensemble is applied to it, 
thereby generating cluster labels for the target data. 
The heart of our approach is a Bayesian framework 
that combines both sources of information (class/cluster 
labels) to yield a consensus labeling of the target data. 

The setting described above is, in principle, differ- 
ent from transductive learning setups where both la- 
beled and unlabeled data are available at the same time 
for model building [15], as well as online methods [BJ. 
Additional differences from existing approaches are de- 
scribed in the section on related works. For the moment 
we note that the underlying assumption is that similar 
new objects in the target set are more likely to share 
the same class label. Thus, the supplementary con- 
straints provided by the cluster ensemble can be use- 
ful for improving the generalization capability of the 
resulting classifier system. Also, these supplementary 
constraints can be useful for designing learning methods 
that help determining differences between training and 
target distributions, making the overall system more ro- 
bust against concept drift. 

We also show that our approach can combine cluster 
and classifier ensembles in a privacy-preserving setting. 
This approach can be useful in a variety of applications. 
For example, the data sites can represent parties that 
are a group of banks, with their own sets of customers, 
who would like to have a better insight into the behavior 
of the entire customer population without compromis- 
ing the privacy of their individual customers. 

The remainder of the paper is organized as follows. 
The next section addresses related work. The proposed 
Bayesian framework — named BC 3 E, from Bayesian 
Combination of Classifiers and Clusterer Ensembles — 
is described in Section [3] Issues with privacy preserva- 
tion are discussed in Section [4] and the experimental 
results are reported in Section [5] Finally, Section [6] con- 
cludes the paper. 

2 Related Work 

The combination of multiple classifiers to generate an 
ensemble has been proven to be more useful compared 



to the use of individual classifiers [17] . Analogously, 
several research efforts have shown that cluster ensem- 
bles can improve the quality of results as compared to a 
single clusterer — e.g., see [5T] and references therein. 
Most of the motivations for combining ensembles of clas- 
sifiers and clusterers are similar to those that hold for 
the standalone use of either classifier or cluster ensem- 
bles. Additionally, unsupervised models can provide 
supplementary constraints for classifying new data and 
thereby improve the generalization capability of the re- 
sulting classifier. These successes provide the motiva- 
tion for designing effective ways of leveraging both clas- 
sifier and cluster ensembles to solve challenging predic- 
tion problems. 

Specific mechanisms for combining classification 
and clustering models however have been introduced 
only recently in the Bipartite Graph-based Consen- 
sus Maximization (BGCM) algorithm [15] . the Locally 
Weighted Ensemble (LWE) algorithm [T3] and, in the 
C 3 E algorithm [3J. Both BGCM and C 3 E have pa- 
rameters that control the relative importance of clas- 
sifiers and clusterers. In traditional semi-supervised 
settings, such parameters can be optimized via cross- 
validation. However, if the training and the target dis- 
tributions are different, cross-validation is not possible. 
From this viewpoint, our approach (BC 3 E) can be seen 
as an extension of C 3 E [5] that is capable of dealing 
with this issue in a more principled way. In addition, 
the algorithms in [13l [T21 [3] do not deal with privacy is- 
sues, whereas our probabilistic framework can combine 
class labels with cluster labels under conditions where 
sharing of individual records across data sites is not per- 
mitted. It uses a soft probabilistic notion of privacy, 
based on a quantifiable information-theoretic formula- 
tion [TB] . Note that existing works on Bayesian classifier 
ensembles — e.g., [TU1H1Q3] — do not deal with privacy 
issues. 

From the clustering side, the proposed model bor- 
rows ideas from the Bayesian Cluster Ensemble [21] . 
In PP, we introduced some preliminary ideas that are 
further developed in our current paper. In particular, 
the algorithm in [I] is not capable of automatically es- 
timating the importance that classifiers and clusterers 
should have. This property is fundamental for applica- 
tions where training and target distributions are differ- 
ent. In addition, the Bayesian model presented here is 
considerably different and requires more sophisticated 
inference and estimation procedures. 

3 Probabilistic Model 

We assume that a classifier ensemble has been (previ- 
ously) induced from a training set. At this point and 
assuming a non-transductive setting, the training data 



can be discarded if so desired. Such a classifier ensem- 
ble is employed to generate a number of class labels (one 
from each classifier) for every object in the target set. 
BC 3 E refines such classifier prediction with the help of 
a cluster ensemble. Each base clustering algorithm that 
is part of the ensemble partitions the target set, pro- 
viding cluster labels for each of its objects. From this 
point of view, the cluster ensemble provides supplemen- 
tary constraints for classifying those objects, with the 
rationale that similar objects — those that are likely 
to be clustered together across (most of) the partitions 
that form the cluster ensemble — are more likely to 
share the same class label. 

Consider a target set X = {x„}^ =1 formed by N 
unlabeled objects. A classifier ensemble composed of 
r*i models has produced r\ class labels for every object 
x„ € X. It is assumed that the target objects belong 
to k classes denoted by C — {Cj}^ =1 and at least 
one object from each of these classes was observed in 
the training phase (i.e. we do not consider "novel" 
classes in the target set). Similarly, consider that a 
cluster ensemble comprised of r 2 clustering algorithms 
has generated cluster labels for every object in the target 
set. The number of clusters need not be the same across 
different clustering algorithms. Also, it should be noted 
that the cluster labeled as J in a given data partition 
may not align with the cluster numbered 1 in another 
partition, and none of these clusters may correspond to 
class 1. Given the class and cluster labels, the objective 
is to come up with refined class probability distributions 
{(P(Ci|x n ))k =1 = y n }n = i of the target set objects. This 
framework is illustrated in Fig. [T] 

The observed class and cluster labels are repre- 
sented as W = {{w ln i}, {w 2n m}} where Wi n i is the 1- 
of-fc representation of class label of the n th object given 
by the I th classifier, and W2 n m is the l-oi-k^ represen- 
tation of cluster label assigned to the n th object by the 
m th clusterer. A generative model is proposed to ex- 
plain the observations W, where each object x n has an 
underlying mixed-membership to the k different classes. 
Let f(y n ) denote the latent mixed-membership vector 
for x n , where /(x) = ^ ° XP ixp(a; ) * s ^ e softmax func- 
tion. y n is sampled from a normal distribution A/"(/Lt, S). 
Also, corresponding to the i th class and m th base clus- 
tering, we assume a multinomial distribution (3 mi over 
the cluster labels of the m th base clustering. Therefore, 

f3 mi is of dimension ft( m ) and Y^j=i Pmij = 1 if the ™ th 
base clustering has k^ clusters. The data generative 
process, whose corresponding graphical model is shown 
in [2] can be summarized as follows. 
For each x n € X: 

1. Choose y n ~ A/"(/x, S), where n £ M. k is the mean 



Source Data 1 



Classifier(s) 



Target Data 
(new unlabeled data {x,j}) 



Class Labels {w^ 



Cluster 
Ensemble 



{P(C|x„)} 



BC 3 E 



Cluster Labels {w 2n } 




Figure 1: Combining Classifiers and Clusterers. 



Figure 2: Graphical Model for BC 3 E 



and £ € R kxk is the covariance. 

2. Choose f?„ ~ Af(y n ,6 2 I k ), where S 2 > is the 
scaling factor of the covariance of the normal 
distribution centered at y n , and I k is the identity 
k x k matrix. 

3. V7 G {1, 2, • • • , ri}, choose w lnl ~ f(y n ). 

4. Vme {1,2,- •• ,r 2 }: 

(a) Choose z nm ~ /(0 n )> where z nm is a 
dimensional vector with l-of-/c representation. 

(b) Choose w-znm ~ multinomial(/3,. Znm ). 

The observed class labels {wi n i} are assumed to 
be sampled from the latent mixed-membership vector 
f(y n ). If the n th object is sampled from the i th class 
in the m th base clustering (implying z nm i = 1), then 
its cluster label will be sampled from the multinomial 
distribution (3 mi . This particular generative process 
is analogous to the one used by the Bayesian Cluster 
Ensemble in [3T]. The fact that 6 n is sampled from 
N{y n ,8 2 Ik) needs further clarification. In practice, the 
observed class labels and cluster labels carry different 
intrinsic weights. If the observations from the classi- 
fiers are assigned too much weight compared to those 
from clustering, there is little hope for the clustering 
to enhance classification. Similarly, if the observations 
from the clustering are given too much of importance, 
the classification performance might deteriorate. Ide- 
ally, the unsupervised information is only expected to 
enhance the classification accuracy. 

Aimed at building a "safe" model that can intel- 
ligently utilize or reject the unsupervised information, 
6 n is sampled from Af(y n ,8 2 Ik) where the parameter 
6 decides how much the observations from the cluster- 
ings can be trusted. If S 2 is a large positive number, 
y n does not have to explain the posterior of n . From 
the generative model perspective, this means that the 



sampled value of 6 n is not governed by y n anymore as 
the distribution has very large variance. On the other 
hand, if S 2 is a small positive number, y n has to ex- 
plain the posterior of 6 n and hence the observations 
from the clustering. Therefore, the posteriors of {y n } 
are expected to get more accurate compared to the case 
if they only had to explain the classification results. A 
concrete quantitative argument for this intuitive state- 
ment will be presented later. 

To address the log-likelihood function of BC 3 E, 
let us denote the set of hidden variables by Z = 
{{y n , {On}, {z nm }}. The model parameters can conve- 
niently be represented by Co = {/x, S, (5 2 , {/3 m ^}}. The 
joint distribution of the hidden and observed variables 
can be written as: 

N 

(3.1) p(X,Z|Co) = Y[p(yJ»,-Z)p(0 n \y nl 5 2 I k ). 

n=l 

ri r 2 

Y[p(winl\f(y n )) Yl P( Z n m \f(0n))p(w2n m \fl,Z nm ) 
1=1 m=l 

The inference and estimation is performed using Varia- 
tional Expectation-Maximization (VEM) to avoid com- 
putational intractability due to the coupling between 
and f3. 

3.1 Approximate Inference and Estimation: 

3.1.1 Inference: To obtain a tractable lower bound 
on the observed log-likelihood, we specify a fully factor- 
ized distribution to approximate the true posterior of 
the hidden variables: 

N 

q(Z\K n }n=i) = l[q(y n \» n ,-£ n ) q (8 n \e n ,A n ) 

n=l 

Q\Znm \ < Pnm) 

m—1 



where y n ~ 7V(/x„,S„), 8 n ~ 7V(e„,A„)Vn G 
{1,2, ••• ,N}, z nm ~ multinomial(0 nm ) Vn G 
{1,2,--- ,JV} and Vm e {1, 2, • • • , r 2 }, and <„ = 
{/J n ,S n ,£„,A n ),{(/i nm }} - the set of variational pa- 
rameters corresponding to the n th object. Further, 



pkxk 



Vn and ri 



(^nmi)i=i Vn, to; where the components of the corre- 
sponding vectors are made explicit. To work with 
less parameters, all the covariance matrices are as- 
sumed to be diagonal. Therefore, £ = diag ((ci)i = i), 

S n = diag ((cx™)iU)> and A » = diag ((^m)i=i)- Us_ 
ing Jensen's inequality, a lower bound on the observed 
log-likelihood can be derived as: 



log[p(X|C )] 
(3.3) 



> E ?(z) [log[p(X,Z|C ) 



larger values of 5 2 will nullify any effect from e„ which, 
in turn, is affected by the observations {i02nm} (as is 
obvious from (14)). On the other hand, if S 2 is small 
enough, e n can strongly impact the values of fi n . 

3.1.2 Estimation: For estimation, we maximize the 
optimized lower bound obtained from the variational in- 
ference w.r.t the free model parameters Co (by keeping 
the variational parameters fixed). The optimal values 
of the model parameters are presented in equations ([5| , 
and (|9j. Since (3 mi is a multinomial distribution, 
the updated values of k( m ' components should be nor- 
malized to unity. However, no closed form of update 
exists for cr 2 , and a numeric optimization method has 
] + H[q(Z)) to be resorted to. The part of the objective function 



where H(q(Z)) = —E q ^ z - ) [\og[q(Z)]] is the entropy of 
the variational distribution q(Z), and E g (z\[.] is the 
expectation w.r.t q(Z). 

Let Q be the set of all distributions having a 
fully factorized form as given in (3.2). The optimal 



distribution that produces the tightest possible lower 
bound £ is given by: 



(3.4) q* 



argminKL(p(Z|X,C )||-7(Z)). 

qeQ 



In equations @, @, ([To]) , ^ and §L4§ in 
Table[l] the optimal values of the variational parameters 
that satisfy (3.4) are presented. Since the logistic 



normal distribution is not conjugate to multinomial, 
the update equations of all the parameters cannot be 
obtained in closed form. For the parameters that do 
not have a closed form solution for the update, we 
just present the part of the objective function that 
depends on the concerned parameter and some numeric 
optimization method has to be used for optimizing the 
lower bound. Since (fi nm is a multinomial distribution, 
the updated values of the k components should be 
normalized to unity. Note that the optimal value of 
one of the variational parameters depends on the others 
and, therefore, an iterative optimization is adopted to 
minimize the lower bound till convergence is achieved. 
.4 

Equations ([6| and ^ present updates for two 
new parameters. These parameters come from 
Eg(log p(w in i\f(y n ))) and E g (log p(z nm \f(O n ))) re- 
spectively. Both of these integrations do not have ana- 
lytic solution and hence a first order Taylor approxima- 
tion is utilized as also done in [5]. A closer inspection 
of (12) reveals that S 2 appears in the denominator of 

k 

the term ^~^(/i n j — e„i) 2 /S 2 in the objective. Hence, 



that depends on <x is provided in Eq. (11). Once 



the optimization in M-step is done, E-step starts and 
the iterative update is continued till convergence. The 
variational parameters {/J n }* =1 are then investigated 
which serve as proxy for the refined posterior estimates 
of {y n }n = i- The main steps of inference and estimation 
are concisely presented in Algorithm [TJ 

Algorithm 1 Learning BC 3 E 
Input: W. 

Output: m ,{M„}£Li- 



Initialize 0"\ {C n }^ v 
Until Convergence 
E-Step 

Until Convergence 
1. Update K n using Eq 
Update £ n using Eq. (|8 
Update 4> nm i using Eq. 



Maximize (12i w.r.t. 
Maximize ( 13 i w.r.t. 

6. Maximize ( 14 i w.r.t. 

7. Maximize ( 10 i w.r.t. 
M-Step 

8. Update /i using Eq. 

9. Update <5 2 using Eq. 




...,7V}. 

,N}. 
Normalize cf> nn 

> 0. 



> 0. 



10. Update /3 m ;j using Eq. ^ Vm,i,j. Normalize 8 ri 

11. Maximize ( flT| | 



w.r.t. 



> 0. 



4 Privacy Preserving Learning 

Most of the privacy-aware distributed data mining tech- 
niques developed so far have focused on classifica- 
tion or on association rules [H 111] , There has also 
been some work on distributed clustering for vertically 
partitioned data (different sites contain different at- 
tributes/features of a common set of records/objects) 
|15j . and on parallelizing clustering algorithms for hori- 
zontally partitioned data (i.e. the objects are distributed 
amongst the sites, which record the same set of features 
for each object) 0. These techniques, however, do not 



Update Equations 


/ fc c»o \ 

finrni « ex P e m + E PmijW2nmj Vn, m, j. (4) 
\ J — / 


N 

= £E*v ( 5 ) 

n=l 


< = E ex P^'» + CT m/ 2 ) Vn. (6) 


AT 

/^mij E &nmiW2nmj V? € 1, 2, ■ ■ • , fc m . (7) 
n=l 


fe 

a = ^exp(e ni + <^/2) Vn ' ( 8 ) 
t=i 


AT fc 

^ 2 = ^ E E - ^™) 2 + <4 + <&] • (9) 

n—1 2—1 


=-2E%-5E 1 °g( 5 n I )-^E^p(e,„+5^/2). (10) 

i=l ^ i=i tn i=l 


fc . N k 2 1 / \2 

=-2 z i° g (^ 2 ) - EEr- + ( "; 2 ! ~ w) j ■ (id 

i=l »=1.=1 CT i 


A;/ \2 ^ ri & k 

£ [M„] = 2E 2 O^E^™ £ni) 2 +EE U ' 1 " i ^™ i t E eXP ^ ni + (T ™/ 2 )- ( 12 ) 
i=l ° i=l i=l i=l 4n i=l 


™ 2 1 1 2 

^ = -1 E ^ - 9 E 1o s( ct ~) - 9 E - - E ex p(^- + ct ™/2)- ( 13 ) 

2=1 1 2=1 2—1 2—1 


£ W = E E - f E ex p(^ + ^/ 2 ) - \ E (e> " / m)2 ■ (1 4 ) 

m—l i—1 i—1 i—1 



Table 1: Equations for update of variational and model parameters in BC 3 E 



specifically address privacy issues, other than through 
encryption [20] . 

This is also true of earlier, data-parallel methods 
[5] that are susceptible to privacy breaches, and also 
need a central planner that dictates what algorithm 
runs on each site. Finally, recent works on distributed 
differential privacy focus on query processing rather 
than data mining [7]. 

In the sequel, we show that the inference and esti- 
mation in BC 3 E using VEM allows solving the clus- 
ter ensemble problem in a way that preserves privacy. 
Depending on how the objects with their cluster/class 
labels are distributed in different "data sites", we can 
have three scenarios - i) Row Distributed Ensemble, ii) 
Column Distributed Ensemble, and iii) Arbitrarily Dis- 
tributed Ensemble. 

4.1 Row Distributed Ensemble: In the row dis- 
tributed ensemble learning framework, the test set X is 
partitioned into D parts and different parts are assumed 
to be at different locations. The objects from partition 
d are denoted by Xd so that X = u£ =1 A'd. Now, a care- 
ful look at the E-step equations reveal that the update 
of variational parameters corresponding to each object 
in a given iteration is independent of those of other ob- 
jects. Therefore, we can maintain a client-server based 
framework where the server only updates the model pa- 
rameters (in the M-step) and the clients (there should 
be as many number of clients as there are distributed 
data sites) update the variational parameters. 



For instance, consider a situation where a dataset 
is partitioned into two subsets X\ and X 2 and these two 
subsets are located in two different data sites. Data 
site 1 has access to X x and a set of clustering and 
classification results pertaining to objects belonging to 
X\. Similarly, data site 2 has access to X 2 and a set of 
clustering and classification results corresponding to X 2 ■ 
Further assume that a set of distributed classification 
(clustering) algorithms were used to generate the class 
(cluster) labels of the objects belonging to each set. 
Now, data site 1 can update the variational parameters 
£„, Vx n g X\. Similarly, data site 2 can update the 
variational parameters for all objects x n £ X 2 . Once the 
variational parameters are updated in the E-step, the 
server gathers information from two sites and updates 
the model parameters. Now, a closer inspection of 
the M-step update equations reveals that each of them 
contains a summation over the objects. Therefore, 
individual data sites can send only some collective 
information to the server without transgressing privacy. 
For example, consider the update equation for j3 m ij. Eq. 
Q can be broken as follows: 

(4.15) f3 mij * OC ^ ( t ) nliW2nli + E ^ 

nli^2nli 

x n £Xi x„£X 2 

The first and second terms can be calculated in data 
sites 1 and 2 separately and sent to the server where 
the two terms can be added and /3 m jj can get updated 
Vm, Similarly, the other M-step update equations 
(performed by the server in an analogous way) also do 



not reveal any information about class or cluster labels 
of objects belonging to different data sites. 

4.2 Column Distributed Ensemble: In the col- 
umn distributed framework, different data sites share 
the same set of objects but only a subset of base clus- 
terings or classification results are available to each data 
site. For example, consider that we have two data sites 
and four sets of class and cluster labels and each data 
site has access to only two sets of classification or clus- 
tering results. Assume that data site 1 has access to 
the 1 st and 2 nd classification and clustering results and 
data site 2 has access to the rest of the results. As in the 
earlier case, a single server and two clients (correspond- 
ing to two different data sites) are maintained. Since 
each data site has access to all the objects, it is neces- 
sary to share the variational parameters corresponding 
to these objects. Therefore, {k„, £„, /x„, er„, e„, 5 n }% =1 
are all updated in the server (which is accessible from 
each client). 

The site (and object) specific variational parameters 
{4>nmi\, however, cannot be shared and should be 
updated in individual sites. This means that the 
updates ([m]), ^ and ^ should be 

performed in the server. On the other hand, the update 
for {</>nmi}Vn,z and m £ {1,2} (corresponding to the 
l nd and 2 nd clustering or classification results) should 
be performed in data site 1. Similarly, the update for 
{4>nmi} Vn, i and m £ {3, 4} has to be performed in data 
site 2. However, while updating {£t n }, the calculation of 

T\ k 

the term y ^ winHl^m has to be performed without 

i=i i=i 

revealing the class labels {wi n i} to the server. To that 

end, it can be rewritten as: 

(4.16) 

v \ k 2 k 4 k 

^ ^ Wlnli^ni, 



1=1 i=l 



1 = 1 i=l 



1=3 i=l 



where the first term can be computed in data site 1 and 
the second term can be computed by data site 2 and 
then can be added in the server. It can be seen that 
{w\ni} can never be recovered by the server and hence 
privacy is ensured in the updates of the E-step. Except 
for {/3 m ij}, all other model parameters can be updated 
in the server in the M-step. However, the parameters 
{Pmij} have to be updated separately inside the clients. 
Since {fimij} do not appear in any update equation 
performed in the server, there is no need to send these 
parameters to the server either. Therefore, in essence, 
the clients update the parameters {4>nmi} and {fi m ij} in 
E-step and M-step respectively, and the server updates 
the remaining parameters. 



4.3 Arbitrarily Distributed Ensemble: In an ar- 
bitrarily distributed ensemble, each data site has access 
to only a subset of the data points or a subset of the 
classification and clustering results. Fig. [3]shows a situ- 
ation with arbitrarily distributed ensemble with six data 
sites. 

We now refer to Fig. [4] and explain the privacy 
preserved EM update for this setting. As before, corre- 
sponding to each different data site, a client node is cre- 
ated. Clients that share a subset of the objects should 
have access to the variational parameters corresponding 
to common objects. To highlight the sharing of objects 
by clients, the test set X is partitioned into four subsets 
— Xi, X 2 , X 3 and^4 as shown in Fig. [3j Similarly, the 
columns are also partitioned into three subsets: G\, G 2 , 
and G3. 

Now, corresponding to each row partition, an "Aux- 
iliary Server" (AS) node is created. Each AS updates 
the variational parameters corresponding to a set of 
shared objects. For example, in Fig. |4.3| ASi updates 
the variational parameters corresponding to X\ (using 
equations @, ([||, ([l4|, and ([lO])). However, 

any variational parameter that is specific to both an 
object and a column is updated separately inside the 
corresponding client (and hence it is connected with C\ 
and C2). Therefore, {4> nm i '■ n £ X\,m £ G{\ are up- 
dated inside client 1 and {4> nm i '■ n £ X%, m £ G% U G3} 
are updated inside client 2 (using Eq. Q). Once all 
variational parameters are updated in the E-step, M- 
step starts. Corresponding to each column partition, an 
"Auxiliary Client" (AC) node is created. This node up- 
dates the model parameters fl m ij (using Eq. Q) which 
are specific to columns belonging to G\. Since C\, C3, 
and C5 share the columns from the subset G\, AC\ is 
connected with these three nodes in Fig. |4.3| The re- 
maining model parameters are, however, updated in a 
"Server" (using equations (5]), @, (fTT])). 

In Fig. |4.3[ the bidirectional edges indicate that 
messages are sent to and from the connecting nodes. 
We have avoided separate arrows for each direction only 
to keep the figure uncluttered. The edges are also 
numbered near to their origin. For a comprehensive 
understanding of the privacy preservation, the messages 
transfered through each edge have also been enlisted in 
the supplementary material. The messages sent from 
the auxiliary servers to the main server are of the form 



given in Eq. (4.15) and are denoted as "partial sums". 



Expectedly, messages sent out from a client node are 
"masked" in such a way that no other node can decode 
the cluster labels or class labels of points belonging to 
that client. This approach is completely general and will 
work for any arbitrarily partitioned ensemble given that 
each partition contains at least two sets of classification 



Figure 3: Arbitrarily Distributed 

Ensemble Figure 4: Parameter Update for Arbitrarily Distributed Ensemble 



results. Note that the ACs and ASs are only helpful 
in conceptual understanding of the parameter update 
and sharing. In practice, there is no real need for 
these extra storage devices/locations. Client nodes can 
themselves take the place of ASs and ACs and even the 
main server as long as the updates are performed in 
proper sequence^ 

5 Experiments 

In this section, two different sets of experiments are 
reported. The first set is for transfer learning with a 
text classification data from eBay Inc. The other set 
is for non-transductive semisupervised learning where 
some publicly available datasets are used to simulate 
the working environment of BC 3 E. 

5.1 Transfer Learning: To show the capability of 
BC 3 E in solving transfer learning problems, we use 
a large scale text classification dataset from eBay Inc. 
The training data consists of 83 million items sold over 
a three month period of time and the test set contains 
several millions of items sold a few days after the train- 
ing period. More details about the dataset can be found 
in [TJ]. eBay organizes items into a six- level category 
structure where there are 39 top level nodes called meta 
categories and 20K+ bottom level nodes called leaf cate- 
gories. The dataset is generated when users provide the 



i Note that such framework allows running the updates of 
the same stage in parallel in different sites, thereby saving the 
computation time in large scale implementations. 



titles of items they intend to sell on eBay. Each title is 
limited to 50 characters, based on which the user gets 
recommendation of some leaf categories the item should 
belong to. Such categorization of the item helps a seller 
list an item in the correct branch of the product list, 
thereby allowing a buyer more easily search through a 
list of few million items sold via eBay every single day. A 
carefully designed fc-Nearest Neighbor (fc-NN) classifier 
(with the help of improved search engine algorithms) 
categorizes each of the items in less than 100 ms [15] , 
However, due to the large number of categories (20K) , 
items belonging to similar types of categories often get 
misclassified. 

To avoid such confusion, larger categories are 
formed by aggregating examples from categories which 
are relatively difficult to separate. Such aggregation is 
easy once the confusion matrix of the classification, ob- 
tained from a development dataset, is partitioned and 
strongly connected vertices (each vertex representing 
one of 20K leaf categories) are identified from the confu- 
sion graph, thereby forming a set of cliques which repre- 
sent the large categories. Note that the large categories 
so discovered might not at all follow the internal hier- 
archy that is maintained. Next, clustering is performed 
with examples belonging to each of the large categories 
and the clustering results, along with the predictions 
from fc-NN classification, are fed to BC 3 E (and also 
to its competitors i.e. C 3 E, BGCM, and LWE). The 
idea here is to first reduce the classification space and 
then use unsupervised information to refine the predic- 
tions from fc-NN on a smaller number of categories. The 



number of leaf categories belonging to such large cate- 
gories usually varies between 4-10. 

However, the dataset is very dynamic and, typi- 
cally over a span of three months, 20% of new words 
are added to the existing vocabulary. One can retrain 
the existing fc-NN classifier every three months, but 
the training process requires collecting new labeled data 
which is time consuming and expensive. One can addi- 
tionally design classifiers to segregate examples belong- 
ing to each of the large categories. However, such ap- 
proach might not improve much upon the performance 
of the initial fc-NN classifier if the data changes so fre- 
quently. Therefore, we require a system that can adap- 
tively predict newer examples without retraining the 
existing classifier or employing another set of classifica- 
tion algorithms. BC 3 E is useful in such settings. The 
parameter S can adjust the weights of prediction from 
classifiers and unsupervised information. As the results 
reported in Table [O] reveal, as long as the classification 
performance is not that poor, BC 3 E can improve on 
the performance of fc-NN using the clustering ensemble. 

The column "Group ID" denotes anonymized 
groups representing different large categories. \X\ shows 
the number of examples in the test data. The column 
"C 3 E-Ideal" shows the performance of C 3 E if the cor- 
rect tuning parameter for C 3 E were known. For a trans- 
fer learning problem, estimating such tuning parameter 
requires some labeled data from the target set which is 
not available in our setting. If the tuning parameter is 
chosen from cross-validation on the training data, the 
final prediction on target set can get affected adversely 
if the underlying distribution changes (and in fact it 
does in our experiments). Therefore, we need to adopt 
a fail-safe approach where we can do at least as good 
as the fc-NN prediction. The results reveal that BC 3 E 
significantly outperforms BGCM and LWE, and some- 
times achieves as good a performance as C 3 E-Ideal (i.e. 
when correct tuning parameter of C 3 E is known). The 
performance of C 3 E-Ideal can essentially be considered 
as the best accuracy one could achieve from the given 
inputs (i.e. class and cluster labels) using other existing 
algorithms — BGCM, LWE, C 3 E — that work on the 
same design space. Though BGCM has a tuning pa- 
rameter, its variation did not affect performance much 
and we just report results corresponding to unity value 
of this parameter. 

5.2 Semi-supervised Learning: Six datasets are 
used in our experiments for semi-supervised learning: 
Half-Moon (a synthetic dataset with two half circles 
representing two classes), Circles (another synthetic 
dataset that has two-dimensional instances that form 
two concentric circles — one for each class), and four 



datasets from the Library for Support Vector Machines 
— Pima Indians Diabetes, Heart, German Numer, and 
Wine. In order to simulate semi-supervised settings 
where there is a very limited amount of labeled in- 
stances, small percentages (see the values reported in 
Table |5.2[ ) of the instances are randomly selected for 
training, whereas the remaining instances are used for 
testing (target set). We perform 20 trials for every 
dataset. For running experiments with BGCM, and 
C 3 E, the parameters reported in [T3] and [2] are used re- 
spectively. The parameters of BC 3 E are initialized ran- 
domly and approximately 10 EM iterations are enough 
to get the results reported in Table |5.2| The classi- 
fier ensemble consists of decision tree (C4.5), linear dis- 
criminant, and generalized logistic regression. Cluster 
ensembles are generated by means of multiple runs of 
fc-means [2]. LWE [12] is better suited for transfer 
learning applications and hence has been left out from 
comparison. The column "Best" in Table |5.2| refers to 
the performance of the best classifier in the ensemble. 
Note that BC 3 E has superior performance for the most 
difficult problems, where one has an incentive to use a 
more complex mechanism. Most importantly, BC 3 E 
has the privacy preserving property not present in any 
of its counterparts. 

6 Conclusion and Future Work 

The BC 3 E model proposed in this paper has been 
shown to be useful for difficult non-transductive semisu- 
pervised and transfer learning problems. A good trade- 
off between accuracy and privacy has also been estab- 
lished empirically - a property absent in any of BC 3 E's 
competitors. With minor modification, BC 3 E can also 
handle soft outputs from classification and clustering 
ensembles which can further improve the results. 
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